Goto

Collaborating Authors

 Rheumatology


Takeda's psoriasis pill developed with AI assistance succeeds in trials

The Japan Times

Takeda's psoriasis pill developed with AI assistance succeeds in trials Psoriasis is a chronic autoimmune disorder that causes rashes marked by itchy, scaly rashes and afflicts more than 125 million people worldwide. Takeda Pharmaceutical announced that its oral psoriasis drug zasocitinib proved safe and effective in late-stage trials, marking a milestone in its effort to treat the incurable skin condition and offset looming revenue pressure. Patients with moderate-to-severe plaque psoriasis who took the once-daily pill showed significantly clearer skin compared with those on placebo or the existing therapy apremilast, the company said in a statement Thursday. Takeda plans to submit data to the U.S. Food and Drug Administration and other regulators beginning in fiscal year 2026. If approved, zasocitinib would join the small but growing oral psoriasis treatments -- long a market dominated by ointments and injectable antibody therapies -- and stand out as one of the first drugs discovered with the help of artificial intelligence.


Classification of autoimmune diseases from Peripheral blood TCR repertoires by multimodal multi-instance learning

Zhang, Ruihao, chen, Mao, Ye, Fei, Meng, Dandan, Huang, Yixuan, Liu, Xiao

arXiv.org Artificial Intelligence

Abstract--T cell receptor (TCR) repertoires encode critical immunological signatures for autoimmune diseases, yet their clinical application remains limited by sequence sparsity and low witness rates. We developed EAMil, a multi-instance deep learning framework that leverages TCR sequencing data to diagnose systemic lupus erythematosus (SLE) and rheumatoid arthritis (RA) with exceptional accuracy. By integrating Prime-Seq feature extraction with ESMonehot encoding and enhanced gate attention mechanisms, our model achieved state-of-the-art performance with AUCs of 98.95% for SLE and 97.76% for RA. EAMIL successfully identified disease-associated genes with over 90% concordance with established differential analyses and effectively distinguished disease-specific TCR genes. The model demonstrated robustness in classifying multiple disease categories, utilizing the SLEDAI score to stratify SLE patients by disease severity as well as to diagnose the site of damage in SLE patients, and effectively controlling for confounding factors such as age and gender . This interpretable framework for immune receptor analysis provides new insights for autoimmune disease detection and classification with broad potential clinical applications across immune-mediated conditions.


Generative Medical Event Models Improve with Scale

Waxler, Shane, Blazek, Paul, White, Davis, Sneider, Daniel, Chung, Kevin, Nagarathnam, Mani, Williams, Patrick, Voeller, Hank, Wong, Karen, Swanhorst, Matthew, Zhang, Sheng, Usuyama, Naoto, Wong, Cliff, Naumann, Tristan, Poon, Hoifung, Loza, Andrew, Meeker, Daniella, Hain, Seth, Shah, Rahul

arXiv.org Artificial Intelligence

Realizing personalized medicine at scale calls for methods that distill insights from longitudinal patient journeys, which can be viewed as a sequence of medical events. Foundation models pretrained on large-scale medical event data represent a promising direction for scaling real-world evidence generation and generalizing to diverse downstream tasks. Using Epic Cosmos, a dataset with medical events from de-identified longitudinal health records for 16.3 billion encounters over 300 million unique patient records from 310 health systems, we introduce the Curiosity models, a family of decoder-only transformer models pretrained on 118 million patients representing 115 billion discrete medical events (151 billion tokens). We present the largest scaling-law study of medical event data, establishing a methodology for pretraining and revealing power-law scaling relationships for compute, tokens, and model size. Consequently, we pretrained a series of compute-optimal models with up to 1 billion parameters. Conditioned on a patient's real-world history, Curiosity autoregressively predicts the next medical event to simulate patient health timelines. We studied 78 real-world tasks, including diagnosis prediction, disease prognosis, and healthcare operations. Remarkably for a foundation model with generic pretraining and simulation-based inference, Curiosity generally outperformed or matched task-specific supervised models on these tasks, without requiring task-specific fine-tuning or few-shot examples. Curiosity's predictive power consistently improves as the model and pretraining scale. Our results show that Curiosity, a generative medical event foundation model, can effectively capture complex clinical dynamics, providing an extensible and generalizable framework to support clinical decision-making, streamline healthcare operations, and improve patient outcomes.


Rethinking Retrieval-Augmented Generation for Medicine: A Large-Scale, Systematic Expert Evaluation and Practical Insights

Kim, Hyunjae, Sohn, Jiwoong, Gilson, Aidan, Cochran-Caggiano, Nicholas, Applebaum, Serina, Jin, Heeju, Park, Seihee, Park, Yujin, Park, Jiyeong, Choi, Seoyoung, Contreras, Brittany Alexandra Herrera, Huang, Thomas, Yun, Jaehoon, Wei, Ethan F., Jiang, Roy, Colucci, Leah, Lai, Eric, Dave, Amisha, Guo, Tuo, Singer, Maxwell B., Koo, Yonghoe, Adelman, Ron A., Zou, James, Taylor, Andrew, Cohan, Arman, Xu, Hua, Chen, Qingyu

arXiv.org Artificial Intelligence

Large language models (LLMs) are transforming the landscape of medicine, yet two fundamental challenges persist: keeping up with rapidly evolving medical knowledge and providing verifiable, evidence-grounded reasoning. Retrieval-augmented generation (RAG) has been widely adopted to address these limitations by supplementing model outputs with retrieved evidence. However, whether RAG reliably achieves these goals remains unclear. Here, we present the most comprehensive expert evaluation of RAG in medicine to date. Eighteen medical experts contributed a total of 80,502 annotations, assessing 800 model outputs generated by GPT-4o and Llama-3.1-8B across 200 real-world patient and USMLE-style queries. We systematically decomposed the RAG pipeline into three components: (i) evidence retrieval (relevance of retrieved passages), (ii) evidence selection (accuracy of evidence usage), and (iii) response generation (factuality and completeness of outputs). Contrary to expectation, standard RAG often degraded performance: only 22% of top-16 passages were relevant, evidence selection remained weak (precision 41-43%, recall 27-49%), and factuality and completeness dropped by up to 6% and 5%, respectively, compared with non-RAG variants. Retrieval and evidence selection remain key failure points for the model, contributing to the overall performance drop. We further show that simple yet effective strategies, including evidence filtering and query reformulation, substantially mitigate these issues, improving performance on MedMCQA and MedXpertQA by up to 12% and 8.2%, respectively. These findings call for re-examining RAG's role in medicine and highlight the importance of stage-aware evaluation and deliberate system design for reliable medical LLM applications.


Topic-aware Large Language Models for Summarizing the Lived Healthcare Experiences Described in Health Stories

Bilalpur, Maneesh, Hamm, Megan, Lee, Young Ji, Norman, Natasha, McTigue, Kathleen M., Wang, Yanshan

arXiv.org Artificial Intelligence

Storytelling is a powerful form of communication and may provide insights into factors contributing to gaps in healthcare outcomes. To determine whether Large Language Models (LLMs) can identify potential underlying factors and avenues for intervention, we performed topic-aware hierarchical summarization of narratives from African American (AA) storytellers. Fifty transcribed stories of AA experiences were used to identify topics in their experience using the Latent Dirichlet Allocation (LDA) technique. Stories about a given topic were summarized using an open-source LLM-based hierarchical summarization approach. Topic summaries were generated by summarizing across story summaries for each story that addressed a given topic. Generated topic summaries were rated for fabrication, accuracy, comprehensiveness, and usefulness by the GPT4 model, and the model's reliability was validated against the original story summaries by two domain experts. 26 topics were identified in the fifty AA stories. The GPT4 ratings suggest that topic summaries were free from fabrication, highly accurate, comprehensive, and useful. The reliability of GPT ratings compared to expert assessments showed moderate to high agreement. Our approach identified AA experience-relevant topics such as health behaviors, interactions with medical team members, caregiving and symptom management, among others. Such insights could help researchers identify potential factors and interventions by learning from unstructured narratives in an efficient manner-leveraging the communicative power of storytelling. The use of LDA and LLMs to identify and summarize the experience of AA individuals suggests a variety of possible avenues for health research and possible clinical improvements to support patients and caregivers, thereby ultimately improving health outcomes.


ProtoMedX: Towards Explainable Multi-Modal Prototype Learning for Bone Health Classification

Pellicer, Alvaro Lopez, Mariucci, Andre, Angelov, Plamen, Bukhari, Marwan, Kerns, Jemma G.

arXiv.org Artificial Intelligence

Bone health studies are crucial in medical practice for the early detection and treatment of Osteopenia and Osteoporosis. Clinicians usually make a diagnosis based on densitometry (DEXA scans) and other patient history. The applications of AI in this field are an ongoing research. Most of the successful methods for this task include Deep Learning models that rely on vision alone (DEXA / X-ray imagery) geared towards high prediction accuracy, where ex-plainability is disregarded and largely based on the post hoc assessment of input contributions. W e propose ProtoMedX, a multi-modal model that uses both DEXA scans of the lumbar spine and patient records. ProtoMedX's prototype-based architecture is explainable by design, crucial for medical applications, especially in the context of the upcoming EU AI Act, as it allows explicit analysis of the model's decisions, especially the ones that are incorrect. ProtoMedX demonstrates state-of-the-art performance in bone health classification while also providing explanations that can be visually understood by clinicians. Using our dataset of 4,160 real NHS patients, the proposed ProtoMedX achieves 87.58% accuracy in vision-only tasks and 89.8% in its multi-modal variant, both approaches surpassing existing published methods.


NurseLLM: The First Specialized Language Model for Nursing

Khondaker, Md Tawkat Islam, Harrington, Julia, Shehata, Shady

arXiv.org Artificial Intelligence

Recent advancements in large language models (LLMs) have significantly transformed medical systems. However, their potential within specialized domains such as nursing remains largely underexplored. In this work, we introduce NurseLLM, the first nursing-specialized LLM tailored for multiple choice question-answering (MCQ) tasks. We develop a multi-stage data generation pipeline to build the first large scale nursing MCQ dataset to train LLMs on a broad spectrum of nursing topics. We further introduce multiple nursing benchmarks to enable rigorous evaluation. Our extensive experiments demonstrate that NurseLLM outperforms SoTA general-purpose and medical-specialized LLMs of comparable size on different benchmarks, underscoring the importance of a specialized LLM for the nursing domain. Finally, we explore the role of reasoning and multi-agent collaboration systems in nursing, highlighting their promise for future research and applications.


H-DDx: A Hierarchical Evaluation Framework for Differential Diagnosis

Lim, Seungseop, Kim, Gibaeg, Lee, Hyunkyung, Han, Wooseok, Seo, Jean, Yoo, Jaehyo, Yang, Eunho

arXiv.org Artificial Intelligence

An accurate differential diagnosis (DDx) is essential for patient care, shaping therapeutic decisions and influencing outcomes. Recently, Large Language Models (LLMs) have emerged as promising tools to support this process by generating a DDx list from patient narratives. However, existing evaluations of LLMs in this domain primarily rely on flat metrics, such as Top-k accuracy, which fail to distinguish between clinically relevant near-misses and diagnostically distant errors. To mitigate this limitation, we introduce H-DDx, a hierarchical evaluation framework that better reflects clinical relevance. H-DDx leverages a retrieval and reranking pipeline to map free-text diagnoses to ICD-10 codes and applies a hierarchical metric that credits predictions closely related to the ground-truth diagnosis. In benchmarking 22 leading models, we show that conventional flat metrics underestimate performance by overlooking clinically meaningful outputs, with our results highlighting the strengths of domain-specialized open-source models. Furthermore, our framework enhances interpretability by revealing hierarchical error patterns, demonstrating that LLMs often correctly identify the broader clinical context even when the precise diagnosis is missed.


Machine Learning Meets Transparency in Osteoporosis Risk Assessment: A Comparative Study of ML and Explainability Analysis

Elias, Farhana, Reza, Md Shihab, Mahmud, Muhammad Zawad, Islam, Samiha, Alve, Shahran Rahman

arXiv.org Artificial Intelligence

The present research tackles the difficulty of predicting osteoporosis risk via machine learning (ML) approaches, emphasizing the use of explainable artificial intelligence (XAI) to improve model transparency. Osteoporosis is a significant public health concern, sometimes remaining untreated owing to its asymptomatic characteristics, and early identification is essential to avert fractures. The research assesses six machine learning classifiers: Random Forest, Logistic Regression, XGBoost, AdaBoost, LightGBM, and Gradient Boosting and utilizes a dataset based on clinical, demographic, and lifestyle variables. The models are refined using GridSearchCV to calibrate hyperparameters, with the objective of enhancing predictive efficacy. XGBoost had the greatest accuracy (91%) among the evaluated models, surpassing others in precision (0.92), recall (0.91), and F1-score (0.90). The research further integrates XAI approaches, such as SHAP, LIME, and Permutation Feature Importance, to elucidate the decision-making process of the optimal model. The study indicates that age is the primary determinant in forecasting osteoporosis risk, followed by hormonal alterations and familial history. These results corroborate clinical knowledge and affirm the models' therapeutic significance. The research underscores the significance of explainability in machine learning models for healthcare applications, guaranteeing that physicians can rely on the system's predictions. The report ultimately proposes directions for further research, such as validation across varied populations and the integration of supplementary biomarkers for enhanced predictive accuracy.


Beyond Jailbreaking: Auditing Contextual Privacy in LLM Agents

Das, Saswat, Sandler, Jameson, Fioretto, Ferdinando

arXiv.org Artificial Intelligence

LLM agents have begun to appear as personal assistants, customer service bots, and clinical aides. While these applications deliver substantial operational benefits, they also require continuous access to sensitive data, which increases the likelihood of unauthorized disclosures. Moreover, these disclosures go beyond mere explicit disclosure, leaving open avenues for gradual manipulation or sidechannel information leakage. This study proposes an auditing framework for conversational privacy that quantifies an agent's susceptibility to these risks. The proposed Conversational Manipulation for Privacy Leakage (CMPL) framework is designed to stress-test agents that enforce strict privacy directives against an iterative probing strategy. Rather than focusing solely on a single disclosure event or purely explicit leakage, CMPL simulates realistic multi-turn interactions to systematically uncover latent vulnerabilities. Our evaluation on diverse domains, data modalities, and safety configurations demonstrates the auditing framework's ability to reveal privacy risks that are not deterred by existing single-turn defenses, along with an in-depth longitudinal study of the temporal dynamics of leakage, strategies adopted by adaptive adversaries, and the evolution of adversarial beliefs about sensitive targets. In addition to introducing CMPL as a diagnostic tool, the paper delivers (1) an auditing procedure grounded in quantifiable risk metrics and (2) an open benchmark for evaluation of conversational privacy across agent implementations.